class: center, middle, inverse, title-slide # The language of models ##
Data Science & Statistics ### Gavin McNicol ### 2021-11-14 --- class: middle # What is a model? --- ## Modelling - Use models to explain the relationship between variables and to make predictions - For now we will focus on **linear** models (but remember there are *many* *many* other types of models too!) .pull-left[ ``` ## `geom_smooth()` using formula 'y ~ x' ``` <img src="data:image/png;base64,#u4-d01-language-of-models_files/figure-html/unnamed-chunk-1-1.png" width="75%" /> ] .pull-right[ ``` ## `geom_smooth()` using formula 'y ~ x' ``` <img src="data:image/png;base64,#u4-d01-language-of-models_files/figure-html/unnamed-chunk-2-1.png" width="75%" /> ] --- # The breathing of the biosphere <img src="data:image/png;base64,#img/monroe-landscape.jpeg" width="65%" /> Photo Credit: Tyler Roman --- ## Data: Morgan-Monroe State Forest Carbon Uptake ```r monroe_fluxes <- read_csv("data/amf-us-mms.csv", na = c("-9999"), skip = 2) ``` - Source: AmeriFlux Network - Site Investigators: Kim Novick and Rich Phillips (Indiana University) - 197568 half-hourly measurements of ecosystem-atmosphere exchanges - That's 11.2767123 years of data --- # Data collection: Tower-based eddy covariance <img src="data:image/png;base64,#img/monroe-tower.jpeg" width="65%" /> Photo Credit: Tyler Roman **YouTube** --- # What does the tower data look like? ```r glimpse(monroe_fluxes) ``` ``` ## Rows: 197,568 ## Columns: 9 ## $ timestamp <dbl> 1.99901e+11, 1.99901e+11, 1.99901e+11, 1.99901e+11, 1.99901e… ## $ fco2 <dbl> 0.063, -0.038, -0.037, 0.091, -0.628, 0.450, 0.120, NA, NA, … ## $ ustar <dbl> 0.344, 0.247, 0.255, 0.227, 0.350, 0.287, 0.179, 0.087, 0.14… ## $ ta <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ le <dbl> -0.350, -0.321, -0.786, -0.191, -0.650, -0.297, -0.284, -0.0… ## $ p <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, … ## $ sw_in <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 47.46, 133.6… ## $ rh <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ pa <dbl> 99.0, 99.0, 99.0, 99.1, 99.2, 99.2, 99.3, 99.3, 99.4, 99.5, … ``` -- ```r monroe_fluxes$timestamp[1] %>% as.character() ``` ``` ## [1] "199901010100" ``` ```r monroe_fluxes$timestamp[197568] %>% as.character() ``` ``` ## [1] "202107160000" ``` -- **197,568 half hours from Jan 01, 1999 at 1AM to July 16, 2021 at midnight!** --- ## Process we're interested in: ### Ecosystem CO2 flux (fco2) (exchange rate of CO2 between ecosystem and atmosphere) <img src="data:image/png;base64,#img/ecosystem-exchange-2.png" width="40%" /> --- # What does CO2 flux look like in our Morgan Monroe State Forest? Try plotting a histogram of `fco2` ```r monroe_fluxes %>% ggplot(aes(fco2)) + geom_histogram() + labs(x = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` --- # What does CO2 flux look like in our Morgan Monroe State Forest? ``` ## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`. ``` ``` ## Warning: Removed 36179 rows containing non-finite values (stat_bin). ``` <img src="data:image/png;base64,#u4-d01-language-of-models_files/figure-html/histogram-2-1.png" width="50%" /> --- # What does CO2 flux look like in our Morgan Monroe State Forest? First, convert to date-time ```r monroe_fluxes <- monroe_fluxes %>% * mutate(date = ymd_hm(timestamp)) %>% relocate(date) %>% select(-timestamp) head(monroe_fluxes, 5) ``` ``` ## # A tibble: 5 × 9 ## date fco2 ustar ta le p sw_in rh pa ## <dttm> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 1999-01-01 01:00:00 0.063 0.344 NA -0.35 0 0 NA 99 ## 2 1999-01-01 02:00:00 -0.038 0.247 NA -0.321 0 0 NA 99 ## 3 1999-01-01 03:00:00 -0.037 0.255 NA -0.786 0 0 NA 99 ## 4 1999-01-01 04:00:00 0.091 0.227 NA -0.191 0 0 NA 99.1 ## 5 1999-01-01 05:00:00 -0.628 0.35 NA -0.65 0 0 NA 99.2 ``` --- # What does an ecosystem's breathing look like in our Morgan Monroe State Forest? Try plotting `fco2` over time ```r monroe_fluxes %>% mutate(sign = ifelse(fco2 >= 0, "net emission", "net uptake")) %>% ggplot(aes(date, fco2, color = sign)) + geom_point() + theme_bw() + labs(y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + scale_color_manual(values = c("red", "dark green")) ``` --- # What does an ecosystem's breathing look like in our Morgan Monroe State Forest? <!-- --> --- # How does the data look for one year? (keep it simple!) ```r monroe_fluxes %>% * filter(date > "2020-01-01" & date < "2021-01-01") %>% mutate(sign = ifelse(fco2 >= 0, "net emission", "net uptake")) %>% ggplot(aes(date, fco2, color = sign)) + geom_point() + theme_bw() + labs(y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + scale_color_manual(values = c("red", "dark green")) ``` --- # How does the data look for one year? (keep it simple!) <!-- --> **YouTube Video** --- # What about just for one 3-day period? (super simple!) ```r monroe_fluxes %>% * filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(date, fco2)) + geom_hline(yintercept = 0) + geom_point() + geom_line(aes(date, fco2)) + theme_bw() + labs(y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + scale_color_manual(values = c("red", "dark green")) ``` --- # What about just for one 3-day period? (super simple!) <!-- --> We now have a good handle on what our data is doing. Next, we might ask: **Why does it vary in this way?** --- class: middle # Modeling the relationship between variables --- ## Models as functions - We can represent relationships between variables using **functions** - A function is a mathematical concept: the relationship between an output and one or more inputs - Plug in the inputs and receive back the output - Example: The formula `\(y = 3x + 7\)` is a function with input `\(x\)` and output `\(y\)`. If `\(x\)` is `\(5\)`, `\(y\)` is `\(22\)`, `\(y = 3 \times 5 + 7 = 22\)` --- ## fco2 as a function of... --- ## fco2 as a function of incoming solar radiation ```r monroe_fluxes %>% * filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(x = sw_in, y = fco2)) + geom_point() + geom_smooth(method = "lm") + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` --- ## fco2 as a function of incoming solar radiation ```r monroe_fluxes %>% * filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(x = sw_in, y = fco2)) + geom_point() + geom_smooth(method = "lm") + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` <!-- --> --- ## ... without the measure of uncertainty ```r monroe_fluxes %>% filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(x = sw_in, y = fco2)) + geom_point() + * geom_smooth(method = "lm", se = FALSE) + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` --- ## ... without the measure of uncertainty ```r monroe_fluxes %>% filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(x = sw_in, y = fco2)) + geom_point() + * geom_smooth(method = "lm", se = FALSE) + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` <!-- --> --- ## ... with different cosmetic choices ```r monroe_fluxes %>% filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(x = sw_in, y = fco2)) + geom_point() + geom_smooth(method = "lm", se = FALSE, * color = "purple", linetype = 2, size = 3) + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` --- ## ... with different cosmetic choices ```r monroe_fluxes %>% filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(x = sw_in, y = fco2)) + geom_point() + geom_smooth(method = "lm", se = FALSE, * color = "purple", linetype = 2, size = 3) + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` <!-- --> --- ## Other smoothing methods: gam ```r monroe_fluxes %>% filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(x = sw_in, y = fco2)) + geom_point() + * geom_smooth(method = "gam", se = FALSE, color = "purple") + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")' ``` --- ## Other smoothing methods: gam ```r monroe_fluxes %>% filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(x = sw_in, y = fco2)) + geom_point() + * geom_smooth(method = "gam", se = FALSE, color = "purple") + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ s(x, bs = "cs")' ``` <!-- --> --- ## Other smoothing methods: loess ```r monroe_fluxes %>% filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(x = sw_in, y = fco2)) + geom_point() + * geom_smooth(method = "loess", se = FALSE, color = "purple") + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` --- ## Other smoothing methods: loess ```r monroe_fluxes %>% filter(date > "2020-07-01" & date < "2020-07-05") %>% ggplot(aes(x = sw_in, y = fco2)) + geom_point() + * geom_smooth(method = "loess", se = FALSE, color = "purple") + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` --- ## Vocabulary - **Response variable:** Variable whose behavior or variation you are trying to understand, on the y-axis -- - **Explanatory variables:** Other variables that you want to use to explain the variation in the response, on the x-axis -- - **Predicted value:** Output of the **model function** - The model function gives the typical (expected) value of the response variable *conditioning* on the explanatory variables -- - **Residuals:** A measure of how far each case is from its predicted value (based on a particular model) - Residual = Observed value - Predicted value - Tells how far above/below the expected value each case is --- ## Residuals ```r co2_flux_fit <- linear_reg() %>% set_engine("lm") %>% fit(fco2 ~ sw_in, data = filter(monroe_fluxes, date > "2020-07-01" & date < "2020-07-05")) co2_flux_fit_tidy <- tidy(co2_flux_fit$fit) co2_flux_fit_aug <- augment(co2_flux_fit$fit) %>% mutate(res_cat = ifelse(.resid > 0, TRUE, FALSE)) ggplot(data = co2_flux_fit_aug) + geom_point(aes(x = sw_in, y = fco2, color = res_cat)) + geom_line(aes(x = sw_in, y = .fitted), size = 0.75, color = "#8E2C90") + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + guides(color = FALSE) + scale_color_manual(values = c("#260b27", "#e6b0e7")) + geom_text(aes(x = 0, y = 20), label = "Positive residual", color = "#e6b0e7", hjust = 0, size = 8) + geom_text(aes(x = 150, y = -30), label = "Negative residual", color = "#260b27", hjust = 0, size = 8) + theme_bw() ``` --- ## Residuals --- .question[ What if we looked at a model fit for an entire year of data? ] ``` ## `geom_smooth()` using formula 'y ~ x' ``` <img src="data:image/png;base64,#u4-d01-language-of-models_files/figure-html/co2-flux-plot-alpha-1.png" width="40%" /> --- ## Why might radiation have no effect on net CO2 flux? -- CLUE: <img src="data:image/png;base64,#img/taylor.jpeg" width="65%" /> --- ## Because of seasonality <img src="data:image/png;base64,#img/monroe-summer.jpeg" width="35%" /> <img src="data:image/png;base64,#img/monroe-fall.jpeg" width="35%" /> --- ## Multiple explanatory variables .panelset[ .panel[.panel-name[Plot] .pull-left-narrow[ .question[ How might the relationship between sunlight and CO2 flux be affected by the season of the year? ] ] .pull-right-wide[ ``` ## `geom_smooth()` using formula 'y ~ x' ``` <img src="data:image/png;base64,#u4-d01-language-of-models_files/figure-html/unnamed-chunk-11-1.png" width="40%" /> ] ] .panel[.panel-name[Code] ```r monroe_fluxes %>% filter(date > "2020-01-01" & date < "2021-01-01") %>% mutate(season = ifelse(date < "2020-04-01", "Winter", NA), season = ifelse(date < "2020-07-01" & date > "2020-04-01", "Spring", season), season = ifelse(date < "2020-10-01" & date > "2020-07-01", "Summer", season), season = ifelse(date < "2020-12-31" & date > "2020-10-01", "Fall", season)) %>% filter(season %in% c("Winter", "Summer")) %>% #<< (sorry T Swift!) ggplot( aes(x = sw_in, y = fco2, color = factor(season))) + geom_point(alpha = 0.4) + geom_smooth(method = "lm", se = FALSE) + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + scale_color_manual(values = c("dark green", "light blue")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ] ] --- ## Extending regression lines .panelset[ .panel[.panel-name[Plot] ``` ## `geom_smooth()` using formula 'y ~ x' ``` <img src="data:image/png;base64,#u4-d01-language-of-models_files/figure-html/unnamed-chunk-12-1.png" width="40%" /> ] .panel[.panel-name[Code] ```r monroe_fluxes %>% filter(date > "2020-01-01" & date < "2021-01-01") %>% mutate(season = ifelse(date < "2020-04-01", "Winter", NA), season = ifelse(date < "2020-07-01" & date > "2020-04-01", "Spring", season), season = ifelse(date < "2020-10-01" & date > "2020-07-01", "Summer", season), season = ifelse(date < "2020-12-31" & date > "2020-10-01", "Fall", season)) %>% filter(season %in% c("Winter", "Summer")) %>% ggplot( aes(x = sw_in, y = fco2, color = factor(season))) + geom_point(alpha = 0.4) + * geom_smooth(method = "lm", se = FALSE, fullrange = TRUE) + labs( title = "Half-hourly CO2 flux vs. incoming solar radiation", subtitle = "For 2020, at Morgan Monroe State Forest", x = expression("Radiation (W m"^{-2}*")"), y = expression("Net CO"[2]*" Flux (nmol m"^{-2}*" s"^{-1}*")")) + scale_color_manual(values = c("dark green", "light blue")) + theme_bw() ``` ``` ## `geom_smooth()` using formula 'y ~ x' ``` ] ] --- ## Models - upsides and downsides - Models can sometimes reveal patterns that are not evident in a graph of the data. This is a great advantage of modeling over simple visual inspection of data. - There is a real risk, however, that a model is imposing structure that is not really there on the scatter of data, just as people imagine animal shapes in the stars. A skeptical approach is always warranted. --- ## Variation around the model... is just as important as the model, if not more! *Statistics is the explanation of variation in the context of what remains unexplained.* - The scatter suggests that there might be other factors that account for large parts of painting-to-painting variability, or perhaps just that randomness plays a big role. - Adding more explanatory variables to a model can sometimes usefully reduce the size of the scatter around the model. (We'll talk more about this later.) --- ## How do we use models? - Explanation: Characterize the relationship between `\(y\)` and `\(x\)` via *slopes* for numerical explanatory variables or *differences* for categorical explanatory variables - Prediction: Plug in `\(x\)`, get the predicted `\(y\)`